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The Hebrew University of Jerusalem 

Candes and Tao, in an impressive and innovative paper, introduce an 
ingenious estimator. Their discussion brings back the standard ^2 l° ss func- 
tion into the main focus of the "large p, small [er] n" discussion. We wish to 
present in this comment an apologia for using the prediction error criterion 
as the way to gauge the quality of the estimator in large-dimension models. 
This is not, however, a postmodernist essay on a cultural aspect of statistics. 
For this the reader may refer to the challenging discussion in Breiman ([1]). 
Our discussion is within the boundaries of the standard decision theory as 
applied to complex parameters. 

The setup we consider is the standard structural point of view of regression 
(see Greenshtein and Ritov [2] for details). We observe an i.i.d. sample from 
the pair (y,x), where x is a p-dimensional random vector, while y is real. 
We may, but do not need to, assume the linear structure y = x'fio + z, where 
the random variable z is independent of x. The informal objective is to find 
a good estimator of /3, so z can be defined by being uncorrelated with the 
residuals y — f3' x. At this stage of the discussion, an estimator cannot be 
said to be the best, since for that we should agree on an exact criterion, 
and unlike the situation with simple parametric models, an estimator will 
be asymptotically efficient when a specific risk function is considered, and 
not so if another criterion is applied. 

The data we consider is V n = {(3/1, xi), . . . , (y n , x n )}, a simple random 
sample from the distribution of (y,x). We compute (5 = (3 n (T> n ). The predic- 
tion criterion compares f3 and 0a not by a direct loss function, for example, 
£2(0 ,Po) = — /3o I [ 2 , but indirectly by comparing the theoretical optimal 
E(y — (3'qx) 2 to the prediction performance of the estimator, E((y — [5 1 x) 2 \V n ). 
The expectation is taken over the distribution of (y,x) from which T> n is 
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sampled. However, the prediction inefficiency is given by 
PIE 2 0) = E((y - (3'x) 2 \V n ) - E(y - (3' x) 2 
= 0-0o)'X x 0-Po) 
= \\P-Po\\k- 

Now, let t(/3o,$) = {i'-Pi + Poi > 0} and E x (i) = (cov(x l , x^))ij et , where 
x = (x \ . . . , x p )' . With this notation 



PIE0) = \\$ - fa 




and under the assumptions of Candes and Tao the PIE2 and £2 criteria are 
comparable. 

What are the p regressors x\ . . . , x p l There are two main possibilities. The 
first is they may be genuine p explanatory variables, representing different 
measurements. Thus, in a particular investigation they can be, among other 
things, height, weight, income, socioeconomic status, gender, the number 
of visits to the supermarket, and so on. In the second extreme situation 
we start with very few explanatory variables (typically one), u € and 
the linear regression problem is defined in terms of x l = if)i(u), i = l,...,p. 
This is the situation we face in standard nonparametric regression techniques 
using wavelet techniques or cubic splines with fixed nodes, or in classification 
techniques like SVM (support vector machine) (where x is defined explicitly 
using the "kernel trick"). 

When we are faced with the structural model with many different variables 
representing conceptually different properties (height and income, e.g.), it 
may make sense to assume that they are normal, but it is very unlikely that 
they are independent. Any strong assumption on the huge n x p matrix is 
hard to conceive. Just think about a 1000 x 10,000 matrix! Furthermore, the 
loss function £2 which makes sense in low dimensions, makes, by itself, little 
sense in high-dimensional spaces. In many cases it represents the average 
error of many different estimators of different quantities. The vector j3 as 
a vector has very little meaning. It is just a collection of parameters. One 
may be interested in the impact of a single variable on the outcome (e.g., 
the number of previous visits to the supermarket), or of a small group of 
variables (e.g., representing the socioeconomic level), but no one has a simple 
interpretation for an eclectic list of 100 parameters. The only reason to 
consider the loss p~ l \\(5 — /?o|| 2 is because it is the mean of the individual 
squared errors. However, there is no apparent reason why it is the arithmetic 
mean that should be taken. The prediction error is an "objective" way to 
find the right weighted mean of the individual errors, and it is strongly 
adapted to the particular situation at hand. 
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Suppose the relevant submatrix of the hat matrix, the one that is re- 
lated to the active variables, has a high conditional number. The estimation 
problem is ill-posed, while the prediction problem is well-defined. If there is 
co-linearity, we may not be able to verify which one of the variables has a 
causal effect, and what is the proper value of each coefficient, but we can 
very well infer the aggregate impact of a group of variables. More problem- 
atic is the meaning of the proper value of a parameter. This can be defined 
as the population value. However, within the context of the "large p, small 
n problem," the model is necessarily defined with respect to the sample size. 
The right model is the best that can be estimated with the given resources. 
With a larger sample size, we may want to use a completely different set of 
variables. 

Consider now the other extreme in which the problem at hand is a non- 
parametric regression of y on a univariate random variable u whose distri- 
bution is unknown. The random vector x is then (ipi(u), . . . , ip p (u))' , where 
ipi , ip2 1 • • • is some basis of L2. The assumption that the components of x are 
normal seems now unreasonable. The assumption that the components are 
independent, or even uncorrelated, is very strong (at best, the tpi,ip2, ■ • • are 
orthonormal with respect to some a priori measure, e.g., Lebesgue, not the 
distribution of u). So, it is hard to see how much regularity can be assumed 
for the design matrix. Let fo(u) = Jjj=i 0oipj( u ) an d f(u) = YTj=i Po' t Pj( u )- 
Then ||/3 — (3q\\ 2 = f(f(u) — fo(u)) 2 du (assuming that the basis functions 
are Lebesgue orthogonal). This does make sense, but the prediction loss 
J(f(u) — fo(u)) 2 dF u (u) is still more reasonable. 

However, restricting ourselves to orthonormal series (even with respect 
to the Lebesgue measure) may be too extreme. We want to invoke sparsity, 
which is essential to the large p analysis, and sparsity depends on finding 
the proper tpis. The same function may be sparse in a given representation 
(e.g., the V'i's are step functions or ramps) and not in other representations 
(e.g., when they are the Haar basis functions or step functions, resp.). Then 
It/5 — A)|| 2 makes very little sense in terms of the estimated function /, but 
the prediction error is still exactly what one needs. 
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